Languguage OS 2

home *** CD-ROM | disk | FTP | other *** search

/ Languguage OS 2 / Languguage OS II Version 10-94 (Knowledge Media)(1994).ISO / language / condor / condor-w.arc next >

Wrap

Text File | 1994-10-18 | 27KB | 621 lines

Subject: Condor mailing list From: miron@chevre.cs.wisc.edu (Miron Livny) Date: Tue, 29 Dec 92 19:39:23 -0600 We are proud to announce the birth of yet another mailing list - the condor_world mailing list (condor-world@cs.wisc.edu). As all of us know, mailing lists are a mixed blessing. We were playing with the idea of starting such a list for more than a year. So far we were able to control our desire to take advantage of the unutilized capacity of the the inter-net to send Condor related mail. However, the interest in Condor and the number of active Condor pools have reached a point where we could not say no any more. The bad part of the news is that not only did we establish such a list, we included your name in the list. You are on the list since at one time or other you expressed interest in Condor, and (maybe) have asked to be on the Condor mailing list. There is also a good part to the story, you can get off the list. Just send a note to "owner-condor-world@cs.wisc.edu". Condor_world should serve as a means to exchange information and experience relating to Condor. We would also like to use it as a channel for comments, wishes and complaints regarding the system. As we continue to work on the problem of batch processing in a cluster of workstation, we would like to know what *you* think about the system. Maybe the best way to get started, would be for each of you to tell us why and how you got involved with batch processing on a cluster of workstations. Mike Litzkow and Miron Livny. ================================================================= Subject: Apology From: condor (Mike Litzkow) Date: Tue, 05 Jan 93 10:58:06 -0600 Dear Colleagues, First I want to apologize to all of you who have been receiving multiple copies of items sent to the "condor-world" mailing list. We are working on the problem, and hopefully you will only get one copy of this message. Also I would like to point out that "condor-world" is an unmoderated mailing list. Every message sent to "condor-world" is rebroadcast to the whole list. This makes it inappropriate to send requests to get on or off the list or notes about technical problems like receiving multiple copies to "condor-world". Please use "condor-world-request" for any messages you do not want to broadcast to the whole group. best regards, -- mike ================================================================= Subject: Condor on HP Snakes From: condor (Mike Litzkow) Date: Fri, 22 Jan 93 13:57:45 -0600 Dear Colleagues, An alpha test version of Condor for HP 700 series machines (Snakes), is now available on our ftp server "ftp.cs.wisc.edu". This code is running on a few machines in our local environment, but is largely untested. Bugs We do know of one "bug" already. HP executables submitted to the condor pool from other platforms will not work becuase of incompatibilities between the system call sets defined by HP-UX and other UNIX variants. HP executables submitted from HP's should work. Hints for Building For building condor, both the "imake" and the "cpp" which came with your system should be fine. We don't recommend using the versions supplied in the "imake_tools" directory. The shell script "mdepend.sh" in the "GENERIC" directory will be needed. Don't use the version of "makedepend" that might have come with your system. best regards, -- mike P.S. I will be at the USENIX conference in San Diego most of next week. I will not be answering mail during that time, but if any of you plan to be there and would like to look me up in person, please do so. I am staying at the conference hotel. ================================================================= Subject: Condor on Silicon Graphics Workstations From: condor (Mike Litzkow) Date: Mon, 01 Mar 93 11:54:00 -0600 Friends, An "alpha" release of Condor for the Silicon Graphics workstations running IRIX 4.X is now available. This has been tested at two sites on IRIX 4.0.5.F systems, but will probably run on other IRIX 4 systems as well. This release is called "Condor_4.1.irix.alpha" and is available by anonymous ftp from "ftp.cs.wisc.edu" as usual. We are interested in feedback on your experiences with building, installing, and using this system. If you have problems, let me know and I will try to help - but please be patient. We do not have an SGI machine here which we can place in our Condor pool for extensive testing, so we are completely dependent on the good will of others for this work. A few notes which should help with the build process follow. cheers, -- mike Notes: 1. Please make a "condor" user and place a "CONDOR" directory in condor's home directory on your build machine. Extract the tar file there. 2. You will need to use "imake" in the build process. You should use the imake already on your system for this, don't build it from my source. I think you will find this in "/usr/bin/X11/imake". Imake will use "cpp" do part of its work. The particular version of cpp used can be altered by setting an environment variable called "IMAKECPP". The SGI supplied "cpp" is fine, so don't set this environment variable. The installation instructions tell you to set up an alias for "imake". Make sure you do that. 3. You will need to use a "make depend" program in the build process. Don't use the "makedepnd" supplied in the X distribution. Use the shell script in the CONDOR/GENERIC directory. You can do this by setting #define MkDepend $(TOP)/GENERIC/mdepend.sh in your "config/SGI_IRIX405.cf" file. 4. On SGI systems (and possibly others) you can use an environment variable (SHELL) to control which shell will be used by "make". The Condor Makefiles expect this to be the bourne shell "/bin/sh". Either "unsetenv" this variable or set it to /bin/sh during your Condor building. ============================================================== Subject: Condor on Silicon Graphics From: condor (Mike Litzkow) Date: Wed, 26 May 93 09:31:27 -0600 Friends, Our alpha test of Condor on Silicon Graphics 4.0.5 machines has turned up a few problems. It seems that due to differences in the compiler technology, the checkpointing mechanism works only on some versions of these machines. Feedback so far indicates that the alpha code is working on the IP7 and IP20 systems, but not on the IP12 and IP17s. You can determine the type of system you have by running "uname -a". I am sorry, but I don't know the mapping between the common names like "Indigo" and "Crimson" and the "IP" designations. The alpha code is still available from "ftp.cs.wisc.edu" for those of you who can use it or would like to play with it. It is unlikely that we will be able to produce an improved version soon. A few folks have asked about running older versions of condor which had some code for IRIX 3.3.1, but that was never official and I believe converting it to work with the IRIX 4* systems would be a very big task. regards, -- mike ============================================================== Subject: Sun Compatibility Problems From: condor (Mike Litzkow) Date: Fri, 04 Jun 93 16:20:42 -0600 Friends, We have recently discovered some incompatibilities between executables built on sparc 10's and other sparc based Suns. You can determine the specific types of your Suns by running "uname -m". The sparc 10's will say "sun4m" while the others will say either "sun4" or "sun4c". If all of your Suns are of the same category, then the problems described here won't affect you. Problem 1: Condor executables built on Sparc 10s cannot run and checkpoint properly on other sparcs. Similarly, executables built on other Suns will not run and checkpoint properly on Sparc 10 systems. The cause is a difference in where the user stack is placed in memory on the two kinds of machines. The usual symptom is that the user process dies with a segmentation fault, (signal 11). If you need to run both sparc 10's and other sparcs in the same Condor pool, you will need to arrange for Condor to view them as two different machine architectures. This can be done by changing the "ARCH" macro in your "condor_config" or "condor_config.local" files. I would suggest setting ARCH to "sun4m" on sparc 10's and "sun4" on the others. Also the condor libraries will need to be compiled separately for the two varieties of machines and distributed in a way which will make the appropriate libraries visible on the proper machines. I believe it is possible to submit jobs to run on "sun4m" machines from "sun4" machines or the reverse. The critical point is that the jobs are linked with a condor library which has been built on the same type of machine where you want them to run. Problem 2: Some of the Condor daemons cannot share executable files between the two kinds of machines. This is due to a difference in the implementations of the "kvm" library on the two platforms, and affects not only Condor, but other system programs like "top" and "ps". We know that the "condor_startd" is affected, but other Condor daemons may be involved as well. I recommend compiling and distributing completely separate sets of Condor executables for the two machine types. The good news is that the incompatibilities exist only at the object code level. No source code changes are needed. best regards, -- mike ============================================================== Subject: Condor and Unsupported System Calls From: condor (Mike Litzkow) Date: Sun, 20 Jun 93 14:42:17 -0600 Friends, There have been a couple of postings to the group lately regarding programs that do system call which aren't supported by Condor. There are a fair number of such sysem calls - pipe, fork, exec, socket, and ioctl to name a few. Quite often the programmer is unaware of using the taboo calls because they are really being executed by some library routine, and the programmer isn't aware that the call is being made until the Condor job fails to run. In most cases when a user program attempts such a system call, Condor will place an error message in the program's standard error file, and then should terminate the job (more on this later). In some cases though, the system call will simply return an error status, and the user program is free to react as it wants. We do it this way because some calls are exercised very frequently by Fortran run time libraries, and could never complete if Condor terminated them. An example of such a call is sigvec(). Most fortran programs will run quite fine under Condor even though at the beginning they do about 20 sigvecs which all fail! In the case of other calls like fork and exec, we feel it is best to let the user know right away that this program cannot be run with Condor. It turns out that Condor_4.1.3b (and possibly other versions as well), has a bug in how the "job terminating" illegal system calls are handled. The bug is in the "condor_shadow" program, and a fix is attached. The bug leads to the program being tried over and over by Condor, even though it can never run. In many cases the erroroneous system call occurs at the very beginning of the program, which means it will execute only a short time before encountering the problem. The user runs "condor_q" many times, but never sees the job in the "Run" state, and so concludes that Condor is refusing to run the job for some unknown reason. I recommend that everyone apply the fix because of the large potential for confusion. regards, -- mike ================================================================= The bug is in the "condor_shadow.c" file around line 466. The sequence if( read(pipe,&msg_type,sizeof msg_type) != sizeof msg_type ) { EXCEPT( "Could not read msg_type from child shadow" ); } should be if( read(pipe,&msg_type,sizeof msg_type) != sizeof msg_type ) { /* the child probably has died */ break; } ============================================================== Subject: Bug Fix From: condor (Mike Litzkow) Date: Tue, 06 Jul 93 10:27:52 -0600 Friends, We have discovered a bug in the "condor_schedd". This bug only affects installations where the condor "log" directory is remotely mounted via NFS. (All diskless machines will be in this category, but machines with disks could be set up this way too.) Description: In such an istallation the Condor deamons need to be able to write to their respective log files, and should do so by running with an effective uid of "condor" during all logging operations. In fact all of the daemons are intended to run with an effective uid of "condor" at all times except in those few instances when local "root" permission is required. This is because at most NFS installations, remote accesses to files by "root" will be handled as if they were attempted by the unprivileged user "nobody". The daemons do need to run with their euid's set temporarily to "root" at certain times, for example when sending signals to local processes. When necessary, the euid is switched between "root" and "condor" with a pair of routines called "set_condor_euid" and "set_root_euid". The bug in the "condor_schedd" causes it to run with an euid of "root" all the times, and it is therefore not able to access remotely mounted log files. Fix: The routines "create_job_queue" and "mark_jobs_idle" both have calls to "set_condor_euid" near the top and "set_root_euid" near the bottom. All 4 of these calls should be eliminated. best regards, -- mike ============================================================= Subject: Sun Checkpointing Problem From: Mike Litzkow <condor@goya.cs.wisc.edu> Date: Thu, 22 Jul 93 11:31:10 -0500 Friends, Description: Some folks have been having a problem with Condor's being unable to produce a "core" file on sun4m (Sparc-10) systems. This prevents the job from checkpointing, and has the symptom that the job's "image size" inexplicably jumps to some unreasonably large value. Once the image size grows very large, Condor will not be able to find any machines where the job can run, so it will sit in the queue forever. So far, the problem has been reported on Sparc-10 systems only, but we are not sure whether it may affect other Sun systems as well. Detailed Discussion: The problem is related to Sun's use of "holes" in their core files. These are large areas in the file which appear as all zero's when the file is read, but actually take up no space on the disk. This means that the core file's virtual size is different from its actual size. In particular the virtual size of the stack segment that is dumped is the same as the stack size *limit* in force at the time of the core dump. The actual size of the dumped stack segment depends on the actual number of pages mapped into the stack segment at dump time. The condor starter wants to allow its user processes to use as many resources as they desire, so it sets all the limits (including the STACK) to "infinity". When it comes time to dump the core, the kernel must decide whether there is sufficient free disk, but apparently it (incorrectly) uses the virtual rather than the actual size of the stack for this calculation. The result is no core file ever appears. The condor_starter then concludes that the core file didn't appear because there was insufficient disk, and adjusts the process's image size up to the amount of free disk which exists at the time. Workaround: The ultimate solution to this problem can only come from Sun, but you can get things working in most cases by setting a more modest limit on the user process's stack size. For most processes 8 megabytes is a reasonable value to use. If you choose this figure, then any user process which needs more than 8 meg of stack will crash. Also, checkpointing will still require something over 8 meg of free disk and will fail on machines with less than that. To set the limit, change the code in the file "condor_starter/starter.c" as follows: The line limit( RLIMIT_STACK, RLIM_INFINITY ); becomes limit( RLIMIT_STACK, 8 * 1024 * 1024 ); best regards, -- mike ============================================================= Subject: Condor for HP-UX 9 Available Date: Mon, 15 Nov 93 10:19:18 -0600 From: condor Hello, A port of CONDOR to the HP PA-RISC machines running HPUX 9.01 is now available, and replaces the previous alpha (broken) version. You can grab a source code tar file via anon ftp from: ftp.cs.wisc.edu:/condor/Condor_4.1.hp700source.beta1.tar.Z We hope to soon have another file available which contains just the HP PA-RISC binaries, to save compiling hassles for folks. Once its available, we'll send another message. Below is a copy of the README.1ST file from the HP PA-RISC condor file, which details what has been fixed. I would like to thank everyone who helped out with getting Condor running on the PA-RISCs. The steady supply of bug reports and suggested fixes was very helpful. I would like to especially thank Bret McKee at HP, who helped us with PA-RISC address queue registers and thus got checkpointing working properly. -Todd Tannenbaum, just a guy dealing with condor on HP PA-RISCs Director of Model Advanced Facility [MAF] UW-Madison Computer Aided Engineering Center Here is the README file. README.1ST: November 11, 1993 What you have here is the beta.1 source code to Condor 4.1 for the HP 9000/700 series of workstations running HPUX 9.01. What is different about this release -vs- the earlier alpha release of Condor for the HP 700? - Checkpointing now works properly with no segmentation faults :-) - This version is written for and tested on HPUX 9.01 only. It has never been tested on HPUX 8.07, although I _think_ it will work after a few minor changes to get it to compile (and a few changes to some of the Imakefiles). Again, don't let the #defines and HPUX8 references everywhere fool you.... this release is for HPUX9.01. - Lots of nonobvious bugs and strange behaviors under HPUX found & fixed. - Remote time usage reporting should now work correctly - The MEMORY config file parameter is no longer needed. Condor will now figure out the amount of RAM installed by reading it out of the kernel. The MEMORY parameter is only used as a fallback in case of an error. What still needs to be done to the HP 700 port of Condor? - Submitting jobs from a platform other than an HP does not work yet. You must submit your job from an HP 700 in order for it to run properly on a condor pool of HP 700s. As it turns out, very few Condor sites care about cross-platform job submitting. I'll hopefully have this working soon. - None of the documentation/man pages have been updated yet. - Although there are #defines everywhere for other platforms, this source tree will only compile on HPUX. We are beginning to work on folding the HP source code back into the main Condor platform-independent source. - The amount of free swap space is still calculated incorrectly. This is used for optimization, and is not _required_, per se. Expect it to be fixed in the next release. Is there an easier way to figure out how to compile my job for condor? Here are a few ideas: (1) try out the condor_compile command, located in the condor_compile subdirectory. Read the README located there. After installing condor_compile, you can compile for condor by typing : condor_compile <whatever you normally type to create an executeable> For instance, if you normally compile by typing "f77 +e +O3 myprog.f", typing "condor_compile f77 +e +O3 myprog.f" will result in a condor executeable called a.out.condor. (condor_compile will append a ".condor" to whatever your executeable name normally comes out as). You can type "condor_compile make myprogram", or "condor_compile cc .....", whatever. condor_compile currently works on HPUX and on SunOS 4.1.x. The whole idea behind condor_compile is a rather ugly hack, but condor end-users who just want to use Condor without a lecture on linking love it. (2) HPUX9.01 supports the "-v" option on most compilers, which displays all the options being passed to ld. Link for condor with the exact same options you see when you compile with "-v", but replace crt0 and -lc with the condor versions. (3) Examine the Makefile in the test suite directories. Enjoy! We currently have about 80 HP 700s in our pool, and it blows the doors off of our old Sun SPARC 1 pool. Happy crunching. Todd Tannenbaum Director of the Model Advanced Facility (MAF) University of Wisconsin-Madison Computer Aided Engineering Center Questions/comments/problems/bugs with CONDOR in general? send internet email to: condor@cs.wisc.edu Questions/comments/problems/bugs *specific to the HP 700 port* of CONDOR? send internet email to: tannenba@engr.wisc.edu ============================================================= Subject: Condor Job Termination Reports Date: Thu, 03 Mar 94 11:21:24 -0600 From: condor A number of folks have asked questions regarding the meanings of the various job termination messages generated by Condor. Often folks have thought that the exit status and termination signal numbers are generated by Condor, and have asked "where in the Condor documentation are these listed?". In fact Condor is only reporting to you the information about how your job terminated which is made available by the underlying operating system (Unix). Following are a few tips on understanding process termination which may be helpful to those of you not already intimate with these details. best regards, -- mike To understand this information, you first need to know that every Unix process will terminate in one of two ways - "normally", or "abnormally". Normal termination A process is said to terminate "normally" when it calls the exit() function, or when the function main() returns. In either case it is possible for the application programmer to provide a number called the "exit status". If your program terminates by calling exit(), then the status is an integer argument to that function. If your program terminates by reaching the end of main(), then the status is the return value of that function. For example: exit( 0 ); or return 0; One aspect which is sometimes confusing is what happens if your application fails to provide a status value at exit time by calling exit() with no arguments, reaching the end of main() with a return statement with no value, or reaching the end of main() with no return statement. In such a case, the exit status is "undefined" - in other words some value will be reported, but it is meaningless. Another aspect which is sometimes confusing is the size of the exit status. In general only 8 bits are allowed for this purpose. On most platforms you can then think of the exit status as an unsigned char, i.e. it can only hold values [0 - 255]. A common mistake is calling "exit( -1 )" in case of an error. The exit status in this case will be reported as 255! When your program exits normally the message from Condor will look something like Your Condor job a.out arg_1 arg_2 arg_3 exited with status 73. In such a case, your system administrator cannot answer the question "what does exit status 73 mean?". That is the exit status returned by your code, which as discussed above, may or may not be meaningful. There are a couple of special cases to consider. First you should realize that your program contains a mixture of your own code, Condor supplied code, and other library code. If some unexpected event causes your program to exit while executing the Condor supplied code, the exit status is generally 4. Also some versions of Condor take the exit status 255 (remember -1), to have special meaning. We recommend that your code always provide a meaningful exit status, and that the values 4 and 255 not be used for this purpose. (It is traditional to return a status of zero when a program terminates correctly, and non-zero when it exits with an error.) Finally, you may wonder why you don't see these "exit status" numbers when you run your job outside of Condor. In fact they do exist, and can be found in the shell variable $status, which is updated after every shell command. For example ls /usr; echo $status will run the "ls" command, and then print its exit status. Abnormal Termination A program is said to have terminated "abnormally" if it is killed by being sent one of a set of signals which causes termination, and that signal is not being blocked, caught, or ignored. In many cases these "terminating" signals will also cause a core file to be generated. Such an untimely death can happen to both Condor and non-Condor processes, and is generally reported by the shell - for example the ever popular Bus error (core dumped) Note that the core dump will not happen if you have set a "coredumpsize" limit too small to allow it, or if your file system doesn't have enough space. When your Condor process is killed by such a signal the message will look something like Your Condor job a.out arg_1 arg_2 arg_3 was killed by signal 10. There may also be a clause telling you that you have a core dump, and the name of the core file. The core file will not appear if there was insufficient file system space on either the executing machine or your machine, or you have asked not to get core files (see condor_submit(1) for details). In this case Condor reports the signal as "number 10", but does not translate the number to the string "Bus error". This is because such translation is generally not portable across various Unix implementations. The meanings of the signals are defined in the header file <signal.h>. Note that the core file may be useful in determining what caused the untimely death of your process, and in particular whether it was executing your code or Condor code at the time of the event. It would thus be bad form to remove the core right before asking your Condor system administrator for help in determining what went wrong.